135 research outputs found

    Efficient Computation of Subspace Skyline over Categorical Domains

    Full text link
    Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. There are only a few algorithms designed to compute the skyline over categorical attributes, yet are applicable only when the number of attributes is small. In this paper, we place the problem of skyline discovery over categorical attributes into perspective and design efficient algorithms for two cases. (i) In the absence of indices, we propose two algorithms, ST-S and ST-P, that exploits the categorical characteristics of the datasets, organizing tuples in a tree data structure, supporting efficient dominance tests over the candidate set. (ii) We then consider the existence of widely used precomputed sorted lists. After discussing several approaches, and studying their limitations, we propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists. Moreover, we further optimize TA-SKY and explore its progressive nature, making it suitable for applications with strict interactive requirements. In addition to the extensive theoretical analysis of the proposed algorithms, we conduct a comprehensive experimental evaluation of the combination of real (including the entire AirBnB data collection) and synthetic datasets to study the practicality of the proposed algorithms. The results showcase the superior performance of our techniques, outperforming applicable approaches by orders of magnitude

    Holistic Twig Joins: Optimal XML Pattern Matching

    Get PDF
    XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable. In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns

    CERTEM: Explaining and Debugging Black-box Entity Resolution Systems with CERTA

    Get PDF
    Entity resolution (ER) aims at identifying record pairs that refer to the same real-world entity. Recent works have focused on deep learning (DL) techniques, to solve this problem. While such works have brought tremendous enhancements in terms of effectiveness in solving the ER problem, understanding their matching predictions is still a challenge, because of the intrinsic opaqueness of DL based solutions. Interpreting and trusting the predictions made by ER systems is crucial for humans in order to employ such methods in decision making pipelines. We demonstrate certem an explanation system for ER based on certa, a recently introduced explainability framework for ER, that is able to provide both saliency explana- tions, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip a prediction. In this demonstration we will showcase how certem can be effectively employed to better understand and debug the behavior of state-of-the-art DL based ER systems on data from publicly available ER benchmarks

    Early online identification of attention gathering items in social media

    Full text link
    Activity in social media such as blogs, micro-blogs, social net-works, etc is manifested via interaction that involves text, images, links and other information items. Naturally, some items attract more attention than others, expressed with large volumes of linking, commenting or tagging activity, to name a few examples. More-over, high attention can be indicative of emerging events, breaking news or generally indicate information items of interest to a vast set of people. The numbers associated with digital social activity are astonishing: in excess of millions of blog posts, tweets and forums updates per day, millions of tags in photos, news articles or blogs. Being able to identify information items that gather much attention in such a real time information collective is a challenging task. In this paper, we consider the problem of early online identifica-tion of items that gather a lot of attention in social media. We model social media activity using ISIS, a stochastic model for Interacting Streaming Information Sources, that intuitively captures the con-cept of attention gathering information items. Given the challenge of the information overload characterizing digital social activity, we present sequential statistical tests that enable early identifica-tion of attention gathering items. This effectively reduces the set of items one has to monitor in real time in order to identify pieces of information attracting a lot of attention. Experiments on real data demonstrate the utility of our model, as well as the efficiency and effectiveness of the proposed sequential statistical tests
    • …
    corecore